Skip to main content

Assignment 1

Report for the first assignment of Effective MLOps: Model Development course.
Created on March 24|Last edited on April 2
Content


Problem and Dataset

  • The problem I picked is an ordinal regression task, namely predicting review rating for books on Goodreads (Kaggle competition)
  • The data is composed of approximately 0.9M book reviews, containing the book, the author, the review text, and its stats. For a full description, check out the information on Kaggle

EDA Raw Data


user_id
book_id
review_id
rating
review_text
date_added
date_updated
read_at
started_at
n_votes
n_comments
1
2
3
4
5
  • The book_id feature is ordinal instead of categorical
  • The date columns (i.e. date_added, date_updated, read_at, started_at) are strings instead of datetime
  • The features read_at and started at have many missing entries

count
avg(n_votes)
n_comments
1
4
2
5
3
3
4
2
5
0
rating
  • Rating 4 is the most common => the naive baseline is setting every review's rating to 4 and computing the F1 score
  • People tend to leave a comment when the review is either really bad or really good

Data Processing

  • Absolute values for votes and comments (cannot be negative)
  • Fill in read_at missing values with added_at
  • Convert date strings to pandas datetime
  • Fill in missing values with the mode of the respective column
  • Derive additional features (missing_started_at, reading_duration, review_length, spoiler, hour, month, dayofweek, year, ...)
  • Drop review text
  • Convert id features to category type

Naïve Baseline

  • Set all ratings in the validation set to 4
  • F1 score - 0.08

Baseline Model

  • LightGBM Classifier (default parameters)
  • Below is the key metric (F1 score), training charts, predictions and feature importance tables.
  • Conclusions:
    • F1 score improve drastically in comparison with the naïve baseline, even though the model was trained using the default hyperparameters
    • Looking at the evolution of the F1 score and multi logloss over the iterations, one can see that they did not reach a plateau yet. Hence, increasing the number of iterations would likely increase the performance of the model
    • The feature importance bar plot, suggests that the book and user id are the most important, implying that certain users and books are reviewed either really well or really poorly
  • F1 score - 0.38



Code available on GitHub
💡
File<(table)>
File<(table)>